Automatic Word Spacing Using Hidden Markov Model for Refining Korean Text Corpora
نویسندگان
چکیده
This paper proposes a word spacing model using a hidden Markov model (HMM) for re ning Korean raw text corpora. Previous statistical approaches for automatic word spacing have used models that make use of inaccurate probabilities because they do not consider the previous spacing state. We consider word spacing problem as a classi cation problem such as Part-of-Speech (POS) tagging and have experimented with various models considering extended context. Experimental result shows that the performance of the model becomes better as the more context considered. In case of the same number of parameters are used with other method, it is proved that our model is more e ective by showing the better results.
منابع مشابه
A Statistical Model for Automatic Extraction of Korean Transliterated Foreign Words
In this paper, we will describe a Korean transliterated foreign word extraction algorithm. In the proposed method, we reformulate the foreign word extraction problem as a syllable-tagging problem such that each syllable is tagged with a foreign syllable tag or a pure Korean syllable tag. Syllable sequences of Korean strings are modelled by Hidden Markov Model whose state represents a character ...
متن کاملHidden Markov Model-Based Korean Part-of-Speech Tagging Considering High Agglutinativity, Word-Spacing, and Lexical Correlativity
متن کامل
Probabilistic Domain Modelling With Contextualized Distributional Semantic Vectors
Generative probabilistic models have been used for content modelling and template induction, and are typically trained on small corpora in the target domain. In contrast, vector space models of distributional semantics are trained on large corpora, but are typically applied to domaingeneral lexical disambiguation tasks. We introduce Distributional Semantic Hidden Markov Models, a novel variant ...
متن کاملWord and Sentence Tokenization with Hidden Markov Models
We present a novel method for the segmentation of text into tokens and sentences. Our approach makes use of a Hidden Markov Model for the detection of segment boundaries. Model parameters can be estimated from pre-segmented text which is widely available in the form of treebanks or aligned multi-lingual corpora. We formally define the boundary detection model and evaluate its performance on cor...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002